17.6 Minimalist Approaches to Deciphering DNA
263
17.5.2
Hidden Markov Models
Knowledge of the actual biological sequence of processing operations can be used to
exploit the effect of the constraints on (nucleic acid) sequence that these successive
processes imply. One presumes that the Markov binary symbol transition matrices
are slightly different for introns, exons, promoters, enhancers, the complementary
strand, and so forth. One constructs a more elaborate automaton, an automaton of
automata, in which the outer one controls the transitions between the different types
of DNA (introns, exons, etc.) and the inner set gives, for each type, the 16 different
binary transition probabilities for the symbol sequence. More sophisticated models
use higher order chains for the symbol transitions; further levels of automata can
also be introduced. The epithet “hidden” is intended to signify that only transitions
from symbol to symbol are observable, not transitions from type to type. The main
problem is the statistical inadequacy of the predictions. A promoter may only have
two dozen bases; a fourth-order Markov chain for nucleotides has of the order of
10 Superscript 101010 transition probabilities.
Problem. Construct a hidden Markov model for the mitogen-activated protein kinase
signalling cascade (Sect. 18.7).
17.6
Minimalist Approaches to Deciphering DNA
The inspiration for this approach is the study of texts written in human languages. A
powerful motivation for the development of linguistics as a formal field of inquiry was
the desire to understand texts written in “lost” languages (without living speakers),
especially those of antiquity, records of which began pouring into Europe as a result
of the large-scale expeditions to Egypt, Mesopotamia, and elsewhere undertaken in
the nineteenth and twentieth centuries. More recently, linguistics has been driven by
attempts to automatically translate texts written in one language into another.
One of the most obvious differences between DNA sequences and texts written
in living languages is that the former lacks separators between the words (denoted
by spaces in most of the latter). Furthermore, unambiguous punctuation marks gen-
erally enable phrases and sentences in living languages to be clearly identified. Even
with this invaluable information, however, matters are far from determined, and the
study of the morphology of words and the rules that determine their association into
sentences (syntax)—that is, grammar—is a large and active research field.
For DNA that is ultimately translated into protein sequences, the nucleic acid–base
pairs are grouped into triplets constituting the reading frames, each triplet correspond-
ing to one amino acid. A further peculiarity of DNA compared with human languages
is that reading frames may overlap; that is, from the sequence AAGTTCTG… one
may derive the triplets AAG, AGT, GTT, TTC, …. This is encountered in certain